Police shootings has been in the nightly news for as long as TV has existed, yet each new shooting brings everyone to shilling realizations. This study aims to shed some light on such dark themes and equip people with the facts.
This data is a list of fatal police shootings from 2015 as recorded by the Washington Post that details the deaths of United States citizens at the hands of police officers. It includes information of the names of the citizens, the manner of death, if the citizen was armed, age, gender, city, and date.
Due to the personal nature of loss, the research team omitted the names of the deceased out of respect.
Below we display our sessionInfo() for replication purposes.
sessionInfo(package=NULL)
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 htmltools_0.3.5
[5] tools_3.3.3 base64enc_0.1-3 yaml_2.1.14 Rcpp_0.12.10
[9] stringi_1.1.5 rmarkdown_1.4 knitr_1.15.1 jsonlite_1.4
[13] stringr_1.2.0 digest_0.6.12 evaluate_0.10
The data set was originally found on data.world, uploaded by a user named carlvlewis who claims that the data set is updated daily. For the purpose of this study we used the latest verison of the data that was available.
The data that was originally obtained was raw and contained private information regarding real people, so the data was cleaned using the lapply and gsub functions. The team removed the names of the deceased and made the race variable clearer. This cleaned data set was upload to the project data set on data.world.
To obtain a copy of this data follow the following steps.
The following is a summary of the cleaned data set:
summary(fatalPoliceShootings)
id name date
Min. : 3.0 TK TK : 21 2015-07-07: 8
1st Qu.: 650.2 Brandon Jones : 2 2015-12-14: 8
Median :1203.5 Daquan Antonio Westbrook: 2 2016-01-27: 8
Mean :1204.1 Eric Harris : 2 2016-12-21: 8
3rd Qu.:1768.5 Jamake Cason Thomas : 2 2017-01-24: 8
Max. :2333.0 Michael Johnson : 2 2017-02-03: 8
(Other) :2059 (Other) :2042
manner_of_death armed age gender
shot :1941 gun :1145 Min. : 6.00 F: 86
shot and Tasered: 149 knife : 310 1st Qu.:26.00 M:2004
unarmed : 152 Median :34.00
vehicle : 132 Mean :36.52
undetermined: 98 3rd Qu.:45.00
toy weapon : 89 Max. :86.00
(Other) : 164 NA's :44
race city state signs_of_mental_illness
: 119 Los Angeles: 31 CA : 350 False:1571
A: 31 Phoenix : 24 TX : 191 True : 519
B: 520 Houston : 23 FL : 125
H: 356 Chicago : 22 AZ : 93
N: 28 Las Vegas : 16 CO : 63
O: 28 Austin : 15 OK : 63
W:1008 (Other) :1959 (Other):1205
threat_level flee body_camera
attack :1347 : 34 False:1866
other : 614 Car : 312 True : 224
undetermined: 129 Foot : 245
Not fleeing:1421
Other : 78
The following is the code we used to clean the data. Specifically we used the ETL to remove the victim’s names, change the name of the state column from lowercase characters to uppercase. Then we changed specific values to be more descriptive. For example, we changed individual’s race from one letter (eg, ‘H’) to the full word (eg, ‘Hispanic’). This was done for our convience to be able to indentify race quicker. The same thing was done to the gender column, switching ‘F’ and ‘M’ to ‘Female’ and ‘Male’ respectively. Finally, we begin to parse through the dimensions of the data frame to remove symbols that would could case errors later when using the actual data.
require(readr)
require(plyr)
file_path = "../01 Data/fatal-police-shootings-data.csv"
df <- read.csv(file_path, header=TRUE, stringsAsFactors=FALSE)
df$name <- NULL
names(df)
str(df)
measures <- c("id", "age")
dimensions <- setdiff(names(df), measures)
dimensions
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
df["state"] <- data.frame(lapply(df["state"], toupper))
df$race <- gsub("W", "WHITE", df$race)
df$race <- gsub("^[H]", "HISPANIC", df$race)
df$race <- gsub("^[B]", "BLACK", df$race)
df$race <- gsub("^[N]", "NATIVE AMERICAN", df$race)
df$race <- gsub("^[A]", "ASIAN", df$race)
df$race <- gsub("^[O]", "OTHER", df$race)
df["race"]
df$gender <- gsub("F", "FEMALE", df$gender)
df$gender <- gsub("^[M]", "MALE", df$gender)
df["gender"]
head(df)
na2emptyString <- function (x) {
x[is.na(x)] <- ""
return(x)
}
if(length(dimensions) > 0) {
for(d in dimensions) {
# Change NA to the empty string.
df[d] <- data.frame(lapply(df[d], na2emptyString))
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
na2zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
if( length(measures) > 1) {
for(m in measures) {
print(m)
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
df[m] <- data.frame(lapply(df[m], na2zero))
df[m] <- data.frame(lapply(df[m], function(x) as.numeric(as.character(x))))
}
}
str(df)
write.csv(df, gsub("-data", "-cleaned", file_path), row.names=FALSE, na = "")
The U.S. Census Bureau and data.world have recently anounced a partnership which has resulted in data.world being host of the Census Bureau’s biggest annual household survey, the American Community Suvery.
Through the official US Census Buearu data.world profile, census data will be offered for any person to use and analyze: https://data.world/uscensusbureau
We used a 2011-2015 Income of US Population Estimates data for this particular study. This dataset was found here: https://data.world/uscensusbureau/acs-2015-5-e-income
Using the data.world R package, the 2011-2015 Income of US Population dataset and the Fatal Police Shootings were queried and pulled into RStudio and combined using dplyr.
Here is a summary of the combined dataset:
summary(incomeOfTheFatallyShot)
X State GINI Per_Capita_Income
Min. : 1.00 CA :15 Min. :0.4181 Min. :22798
1st Qu.: 25.75 TX : 8 1st Qu.:0.4618 1st Qu.:25737
Median : 50.50 OR : 6 Median :0.4753 Median :26999
Mean : 50.50 AL : 5 Mean :0.4706 Mean :27979
3rd Qu.: 75.25 FL : 5 3rd Qu.:0.4801 3rd Qu.:30318
Max. :100.00 TN : 5 Max. :0.5317 Max. :47675
(Other):56
Median_Family_Income Median_Non_Family_Income Median_Income
Min. :51782 Min. :23027 Min. :41371
1st Qu.:57856 1st Qu.:28639 1st Qu.:47507
Median :62717 Median :31848 Median :51243
Mean :64588 Mean :33022 Mean :53300
3rd Qu.:70720 3rd Qu.:37909 3rd Qu.:61062
Max. :90089 Max. :61466 Max. :74551
id date manner_of_death armed
Min. : 3 2016-01-27: 8 shot :91 gun :55
1st Qu.:1164 2016-01-16: 6 shot and Tasered: 9 knife :16
Median :1188 2016-01-17: 5 toy weapon:11
Mean :1067 2016-01-18: 5 unarmed : 7
3rd Qu.:1219 2016-01-31: 5 vehicle : 7
Max. :1288 2016-02-04: 5 chain saw : 1
(Other) :66 (Other) : 3
age gender race city
Min. :12.00 FEMALE: 6 ASIAN : 2 Kansas City: 2
1st Qu.:28.00 MALE :94 BLACK :17 Mesa : 2
Median :35.50 HISPANIC :18 San Antonio: 2
Mean :36.91 NATIVE AMERICAN: 3 Acworth : 1
3rd Qu.:45.00 OTHER : 1 Albuquerque: 1
Max. :64.00 WHITE :58 Aloha : 1
NA's : 1 (Other) :91
signs_of_mental_illness threat_level flee body_camera
false:66 attack:57 Car :18 false:87
true :34 other :43 Foot :16 true :13
Not fleeing:63
Other : 3
The columns of GINI, Per_Capita_Income, Median_Family_Income, and Median_Non_Family_Income were added from the census data and matched on state of the individual shot. This dataset was saved as a CSV for later use in Tablaeu and R Markdown.
To create interesting visualizations, we must first understand what the combined data means. The columns that were queried from data.world were GINI, Per_Capita_Income, Median_Family_Income, and Median_Non_Family_Income. These are state-level summaries, meaning that each fatal shootings has per capita income, median family income, median non-family income information based on the state in which the shooting took place. Although this mixes individualized data with general data, it sheds some light into the type of environment and socioeconomic context the shooting took place.
The Median Family Income vs. Fleeing plot is shown above. This plot represents a boxplot example. The color is for the gender. Generally you see that that average median family income for all of the data points fall roughly between 70 K and 55 K. There are few outliers outside of this range meaning that the majority of data points are similar.
plot(boxplot)
This plot shows the median family income sorted by how the person fled from the police. We can see that the average of the state’s median family income for people who fled on foot is higher than the other ways of fleeing. This plot differs from the Tableau plot in that R/Shiny cannot display individual dots. The sampling from the data is also different in Shiny so the numbers are different we see a higher average median family income for people fleeing on foot.
This dual axis histogram represents the per capita income across the x axis. The left axis has the count of the per capita income and the right has the average median income. The blue bars go with the left axis and the orange dots go with the right. There is also a general average line displayed in addition to a quarter page system, the current quarter is Q1. It’s interesting to note that the peak per capita income count peaks at 26K and the average median income steadily increases with the increase in per capita income.
This is the same visualization as shown above except that this visualization also includes actions when selecting specific data from the histogram.
plot(histogram)
In this histogram, we plot the counts of per capita income for people shot by the police. We can see in this graph that the majority of people are from states that have low per capita income (less than 30k). This is different from our Tableau plot in that we were unable to do a dual axis plot in ggplot to show the average median income.
The map is a relatively simple example where each state is colored by the average median income. The darker colored states have a higher income than the lighter colored states. It’s interesting to note that closer to either coast appears to have higher income than the middle of US.
This plot is a representation of the age vs. median income with average trendlines. The color is based upon the threat level. It’s interesting to note that as median income increases age decreases for the trendline if the threat level was undetermined.
Scatter Plot
This is based off of other plots already provided/described. This is simply all of the details about the trend model off of the scatter plot.
The dashboard represents two different plots a scatter plot example and a histogram example. These plots are also used for actions and further described during other parts of the notebook. This page makes the action easier to see though so that switching between workbook sheets is not required.
plot(scatterplot)
This scatterplot depicts both median family income and GINI index for that state. It is colored to represent what the criminal was armed with when shot by the police. We can see that there is no correlation between median family income and GINI index, and that criminals with low to average family income tended to be armed with guns. In Shiny, this graph is equipped with actions to zoom in on a set of points or individual points.
This plot graphs people who were shot by the police by fleeing and their mental state, and shows their median family income. Of particular note is that mentally ill people who were fleeing on foot had higher median family income
plot(fleePlot)
This is a barchart depicting income of how people are fleeing, separated by signs of mental illness. The line shows the average of median incomes by mental illness and feeling type.
The Gender vs. Race crosstab plot is shown above. This graph shows the race of the individual shot against the individual’s gender. Each cell has the median income colored by race. A set was created by filtering those individuals that were shot in a state whose median family income was between 46,000 and 62,000. The set is what is the smaller text underneath the larger text for the cells where it is appropriate. It is interesting to see that in general the females average median income is higher.
plot(genderRacePlot)
This graph shows the race of the individual shot against the individual’s gender. A set was created by using the dplyr filter function to separate those individuals that were shot in a state who’s state had a median family income was between 46,000 and 62,000.
This is a graph of gender of the individual shot vs if that individual showed signs of mental illness. The graph was then colored based on whether that particular individual was shot in area which had high/medium/low per capita income. This is a good example of using a parameter.
plot(raceFleePlot)
This R visualization was created using the calculated fields of Median(MedianFamilyIncome/PerCapitaIncome) and plotting based on how the individual fled against the race of the individual shot.
The Median Income of Race Broken up by Gender is shown above. This graph is a representation of a barchart and table calculation. This combines both our data set and the census data set through the variables of race, gender and Median Income respectively. Multiple rows are used used to break down the race into genders for each field. A calculated table parameter is used specifically the avg_median_income - window_avg_income. This parameter is the color in addition to the text and helps show that the males have relatively high numbers for this parameter even though they may have relatively low avg_median_income on it’s own compared to the females.
plot(incomeByRacePlot)
The Median Income of Race Broken up by Gender is shown above, specifically the R version. This combines both our data set and the census data set through the variables of race, gender and Median Income respectively. A facet is used to break down the race into genders for each field. A calculated table parameter is used specifically the avg_median_income - window_avg_income. This parameter text helps show that the males have relatively high numbers for this parameter even though they may have relatively low avg_median_income on it’s own compared to the females. The numbers are slightly different due to limits on the SQL statement so the process does not take to long to pull the data.
Individuals were plotted with their median income. A set was created for high income criminals, who had over 60k in income. These people were then plotted with their state’s GINI score which is shown here. All these high income criminals all had relatively the same GINI score.
plot(inequalityPlot)
This barchart shows the GINI inequality index of the area criminals are from, using ID-sets to separate high income criminals.
All R visualizations were graphed on crosstabs in R Shiny. Each graph in each of the different tabs. Here is a link to the published shiny application: https://robin-stewart.shinyapps.io/final_project/
Online Shiny Application